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This  paper  describes  a  unified,  element  based  Galerkin  (EBG)  framework  for  a  three-dimen¬ 
sional,  nonhydrostatic  model  for  the  atmosphere.  In  general,  EBG  methods  possess  high- 
order  accuracy,  geometric  flexibility,  excellent  dispersion  properties  and  good  scalability. 
Our  nonhydrostatic  model,  based  on  the  compressible  Euler  equations,  is  appropriate  for 
both  limited-area  and  global  atmospheric  simulations.  Both  a  continuous  Galerkin  (CG), 
or  spectral  element,  and  discontinuous  Galerkin  (DG)  model  are  considered  using  hexahe- 
dral  elements.  The  formulation  is  suitable  for  both  global  and  limited-area  atmospheric 
modeling,  although  we  restrict  our  attention  to  3D  limited-area  phenomena  in  this  study; 
global  atmospheric  simulations  will  be  presented  in  a  follow-up  paper.  Domain  decompo¬ 
sition  and  communication  algorithms  used  by  both  our  CG  and  DG  models  are  presented. 
The  communication  volume  and  exchange  algorithms  for  CG  and  DG  are  compared  and 
contrasted.  Numerical  verification  of  the  model  was  performed  using  two  test  cases:  flow 
past  a  3D  mountain  and  buoyant  convection  of  a  bubble  in  a  neutral  atmosphere;  these 
tests  indicate  that  both  CG  and  DG  can  simulate  the  necessary  physics  of  dry  atmospheric 
dynamics.  Scalability  of  both  methods  is  shown  up  to  8192  CPU  cores,  with  near  ideal  scal¬ 
ing  for  DG  up  to  32,768  cores. 
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1.  Introduction 

As  the  resolution  of  numerical  weather  prediction  (NWP)  models  increase,  nonhydrostatic  effects  become  relevant. 
Almost  all  current  limited-area  models  use  a  nonhydrostatic  core,  while  global  models  are  currently  transitioning  from 
the  hydrostatic  to  the  nonhydrostatic  regime.  A  host  of  challenging,  non-trivial  numerical  problems  arise  when  one  enters 
the  nonhydrostatic  regime:  these  include  (1)  choosing  the  appropriate  equation  set  to  ensure  efficiency,  accuracy,  and  con¬ 
servation,  (2)  effectively  resolving  multi-scale  flow  features,  that  may  require  adaptive  mesh  refinement  (AMR),  (3)  devel¬ 
oping  efficient  time-integrators  and/or  “soundproof’  [24,31]  equation  sets  to  confront  the  fast  acoustic  and  gravity 
waves,  and  (4)  developing  scalable  parallel  codes  for  shared,  distributed,  and  hybrid  architectures  based  on  items  (l)-(3). 

Virtually  all  current  nonhydrostatic  NWP  models  are  based  on  a  combination  of  finite-difference  discretization  in  space 
and  either  split-explicit  or  semi-implicit  (i.e.,  implicit-explicit  or  1MEX)  discretization  in  time.  Examples  include  WRF  [25] 
(NCAR),  Lokal  Modell  [33]  (DWD),  COAMPS  [20]  (US  Navy),  and  UM  [4]  (UK  Met  Office).  Although  finite  difference  schemes 
are  very  efficient,  they  may  suffer  from  several  problems,  including:  dispersion  error  (due  to  low-order  approximations), 
geometric  inflexibility,  and  lack  of  scalability  to  large  (e.g.  tens  of  thousands)  of  processors.  To  overcome  some  of  the  lim¬ 
itations  of  finite-difference  approximations,  several  emerging  NWP  models  use  finite-volume  spatial  discretization,  such 
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as  MPAS  [40]  and  MCORE  [41],  which  employ  polynomial  reconstruction  of  the  inter-element  fluxes.  In  order  to  achieve 
high-order  accuracy,  these  finite-volume  based  methods  use  a  larger  halo,  which  may  impede  scalability  at  high  processor 
counts.  Other  approaches  within  a  finite  volume  framework  include  the  flux-based  characteristic  semi-Lagrangian  (FBCSL) 
method  [30],  which  uses  the  method  of  characteristics  to  enable  larger  time-steps  that  may  present  an  obstacle  to  scalability. 

Alternatively,  element-based  Galerkin  (EBG)  methods  have  recently  been  proposed  for  several  next  generation  non¬ 
hydrostatic  NWP  models  [15,16,32],  as  well  as  for  hydrostatic  models  using  both  continuous  Galerkin  (CG)  [8,7]  and  dis¬ 
continuous  Galerkin  (DG)  [29]  formulations.  We  note  that  the  hydrostatic  models  are  only  valid  for  horizontal  resolutions 
coarser  than  10  km  and  for  this  reason  are  not  viable  options  for  mesoscale  (regional)  modeling.  EBG  methods  possess  sev¬ 
eral  desirable  attributes  for  future  NWP  nonhydrostatic  models,  such  as  high-order  accuracy,  geometric  flexibility,  whereby 
the  solver  is  completely  independent  of  the  grid,  excellent  dispersion  properties  [27],  and  minimal  communication  over¬ 
head  within  a  parallel  implementation.  High-order  accuracy,  which  allows  fine-scale  atmospheric  flow  features  to  be 
resolved,  is  achieved  by  representing  the  prognostic  variables  via  a  polynomial  basis  function  expansion  within  each  ele¬ 
ment.  Geometric  flexibility,  which  is  inherent  to  all  EBGs  (both  low  and  high  order),  is  advantageous  since  any  terrain-fol¬ 
lowing  coordinate  may  be  used  within  the  same  solver;  also,  both  static  and  dynamic  adaptivity  may  be  retro-fitted  to  the 
existing  solver. 

Finally,  in  a  distributed  memory  environment  (e.g.  cluster),  low  communication  overhead  is  critical  for  achieving  linear 
scalability  to  hundreds  of  thousands  of  processor  cores.  In  our  previous  work,  high-order  accuracy  and  geometric  flexibility 
were  addressed  within  explicit  [15]  and  IMEX  [16]  frameworks  for  2D  (x-z  slices)  problems  using  a  serial  implementation. 
However,  parallel  implementation  was  not  explicitly  addressed.  The  purpose  of  the  present  work  is  to  extend  the  work 
begun  in  [15,32,16]  to  realistic  three-dimensional  (3D)  domains  using  a  parallel,  MPI-based  implementation. 

The  present  work  is  guided  by  a  need  for  highly  scalable  models  in  distributed  memory  environments.  In  the  past  five 
years,  clock  speeds  of  processors  have  remained  stagnant;  to  achieve  increased  floating-point  performance,  chip  manufac¬ 
turers  have  developed  multiple  core  processors  that  allow  many  threads  to  execute  in  parallel.  In  tandem,  high  performance 
computing  (HPC)  has  evolved  towards  clusters  with  processors  counts  exceeding  100,000.  As  we  approach  the  exaflop  era, 
core  counts  are  expected  to  approach  1,000,000.  Therefore,  next-generation  NWP  models  must  be  based  upon  scalable 
numerical  methods  that  allow  arbitrarily  large  processor  counts  with  minimal  communication  overhead.  EBG  methods,  both 
CG  and  DG,  have  proved  effective  in  this  respect  in  modeling  biological  flow  [18]  (DG),  the  hydrostatic  atmosphere  [12,17] 
(CG),  incompressible  flow  [21]  (low-order  CG),  and  geodynamical  problems  [42]  (DG). 

This  paper  presents  a  scalable,  3D  nonhydrostatic  atmospheric  model  based  on  a  unified  element  based  Galerkin  (EBG) 
method  targeted  toward  distributed  memory  architectures;  to  our  knowledge,  these  are  the  first  3D  continuous  Galerkin 
(spectral  element)  and  discontinuous  Galerkin  models  for  a  nonhydrostatic  atmosphere.  We  are  developing  a  unified  dynam¬ 
ical  core  appropriate  for  both  mesoscale  and  global  simulations;  this  nonhydrostatic  unified  model  of  the  atmosphere  we  call 
NUMA.  The  current  implementation  uses  tensor  products  of  Lagrange  polynomials  in  a  hexahedral  grid  for  maximum  com¬ 
putational  efficiency;  however,  more  flexible  grids  based  on  either  tetrahedral  or  triangular  prisms  may  be  incorporated  into 
future  versions.  The  remainder  of  this  paper  is  structured  as  follows.  In  Section  2,  we  formulate  the  nonhydrostatic  com¬ 
pressible  Euler  equations  (set  2NC  and  2C  from  [16]),  which  constitute  the  governing  equations  of  our  dynamical  core.  To 
ensure  a  well-balanced  method,  these  nonhydrostatic  equations  are  solved  about  a  hydrostatic  base  state.  In  Section  3, 
we  present  both  the  continuous  and  discontinuous  Galerkin  discretization,  along  with  the  explicit  time-integrator,  boundary 
conditions,  and  artificial  diffusion.  Section  4,  which  forms  the  core  of  the  paper,  outlines  the  parallelization  algorithm  for 
both  the  CG  and  DG  methods.  The  communication  volume  and  memory  access  patterns  of  both  CG  and  DG  are  compared 
in  this  section.  Numerical  results  for  a  3D  linear  hydrostatic  mountain  and  a  3D  rising  thermal  bubble  are  shown  in  Section 
5  along  with  the  results  of  the  scalability  experiments.  Perfect  scalability  is  demonstrated  to  8192  processor  cores  for  both 
the  CG  and  DG  methods,  with  near  ideal  scaling  for  DG  up  to  32,768  cores. 

2.  Governing  equations 

We  consider  the  fully  compressible,  nonhydrostatic  Euler  equations  in  both  conservative  and  non-conservative  form. 
These  equations  (sets  2NC  and  2C),  which  are  valid  for  spatial  resolutions  finer  than  10  km,  have  previously  been  considered 
in  [15]  for  2D  limited-area  atmospheric  flows.  Set  2NC  is  considered  within  a  continuous  Galerkin  (CG)  framework,  whereas 
set  2C  is  considered  within  a  discontinuous  Galerkin  (DG)  framework. 

2.1.  Non-conservative  form  (set  2NC) 

Of  the  five  equation  sets  considered  in  [16],  set  2NC  proved  to  be  both  computationally  efficient  and  provides  acceptable 
mass  and  energy  conservation  properties;  in  addition,  replacing  the  advection  operator  in  the  momentum  equation  allows 
for  formal  conservation  of  energy  up  to  time  truncation  error.  Both  diabatic  forcing  and  the  effects  of  moisture  are  neglected; 
in  other  words,  we  consider  a  dry  dynamical  core  without  sub-grid  scale  turbulence  closure.  In  the  present  study,  we  con¬ 
sider  three-dimensional  flow  in  Cartesian  coordinates  (x-y-z)  subject  to  gravitational  and  Coriolis  forces,  yielding 


^+V-(pu)  =  0 
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—  +  u  ■  Vu  +  -  VP  +  gk  +  f  x  u  =  0 
dt  p 


80 

dt 


+  u  ■  V0=  0 


(lb) 


(lc) 


with  the  prognostic  variables  defined  as  (p,  uT,  0),  where  p  is  density,  u  =  (u,  v ,  w)T  is  velocity,  and  6  is  potential  temperature; 
the  superscript  T  denotes  the  transpose  operator.  In  addition,  P  is  pressure,  g  is  the  gravitational  constant,  f  =  2Qk  is  the 
Coriolis  parameter  (with  Q  the  angular  frequency  of  the  earth),  and  k  is  the  unit  vector  in  the  z  direction.  Eq.  ( 1  a)  enforces  mass 
conservation,  Eq.  (lb)  enforces  conservation  of  momentum,  and  Eq.  (lc)  enforces  conservation  of  entropy.  To  close  the  system 
of  conservation  laws  given  by  Eq.  (1 ),  a  thermodynamic  equation  of  state  is  required.  We  use  the  ideal  gas  law  given  by 


P  =  Pa 


m 


(2) 


where  PA  is  the  atmospheric  pressure  at  ground,  R  =  cp  -  cv  is  the  ideal  gas  constant,  and  y  =  ss  1 .4  is  the  ratio  of  specific 
heats.  In  three  dimensions,  Eqs.  (1)  and  (2)  constitute  a  closed  system  of  nonlinear  partial  differential  equations. 

To  facilitate  the  solution  of  the  compressible  Euler  equations  and  maintain  numerical  stability,  we  split  the  density,  pres¬ 
sure,  and  potential  temperature  about  their  mean  hydrostatic  values: 


p(x,z,y,t)  =  p0(z)  +  p'(x,y,z,t) 
0(x,y,z,t)  =  0o(z)  +  9'(x,y,z,  t) 
P(x, y,z,  t)  =  P0(z)  +  P'(x,y,z,  t) 


(3a) 

(3b) 

(3c) 


where  p0,  60,  and  P0  are  the  hydrostatic  reference  states.  For  all  the  test  cases  considered  in  this  paper,  the  reference  states 
are  functions  of  the  vertical  coordinate  z;  however,  more  sophisticated  test  cases  require  reference  states  that  are  functions 
of  x,  y,  and  z.  Inserting  Eq.  (3)  into  Eq.  (1)  and  applying  the  hydrostatic  balance 

dP0 

Hz=~p°g 
yields  the  system 

dp ' 


(4) 


+  u  ■  Vp'  +  u  ■  Vp0  +  (p1  +  p0)\7  ■  u  =  0 


dt 

du  _  1 

—  +  U-  Vu  +  — - 

dt  p '  +  p0 


VP'  + 


P' 


P'  +  Po 


gk  +  f  x  u  =  0 


r)0' 

— +  u- V0'  +  u- V0O  =  0. 
dt 

Defining  a  solution  vector  q  =  (p',ur,  0')T,  allows  us  to  write  Eq.  (5)  in  the  condensed  form 

|MNC(q> 

where  S2NC(q)  is  a  nonlinear,  first-order  differential  operator. 

2.2.  Conservative  form  (set  2C) 

Eq.  (1)  may  be  written  in  flux,  or  conservative  form 
dp 


(5a) 

(5b) 

(5c) 

(6) 


dt 

dU 

d& 

ht 


+  vu  =  o 
+  V.'U0U 


■  V- 


0U 

p 


+  PI3  +  pgk  +  f  X  U  =  0 


=  0 


(7a) 

(7b) 

(7c) 


complemented  by  the  equation  of  state 
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P  Pa  Pa 
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with  the  prognostic  variables  defined  as  (p,  UT,  ©)T,  where  p  is  density,  U  =  pu  is  momentum,  u  =  (u,  v,wjT  is  velocity,  and 
&  =  p6  is  density  potential  temperature.  In  addition,  the  operator  ®  denotes  the  tensor  product  and  I3  is  the  rank-3  identity 
matrix.  The  variables  in  these  equations  are  split  in  a  similar  fashion  to  Eq.  (1)  except  that  we  now  split  the  density  potential 
temperature  as  follows 


®(x,  y ,  z,  t)  =  @o  (z)  +  &{x,  y ,  z,  t) 

Applying  the  hydrostatic  decomposition  to  Eq.  (7)  yields 

dp1 


dt 


+  V  U  =  0 


(9) 


(10a) 


<9U  /U®U  ,  \  ,  ,  r 

-gjr  +  V  ■  - h  P I3 )  +  p  g\a  +  f  X  U  =  0 


0 

which  may  be  written  in  vector  form  as 

<9q 


dt 


+  V  ■  F(q)  =  S(q) 


(10b) 

(10c) 

(11) 


where  F(q)  is  the  flux  tensor  and  the  gravitational  and  Coriolis  terms  are  incorporated  into  S(q).  Finally,  we  can  write  this 
equation  in  the  condensed  form 


(12) 


3.  Numerical  methods 

In  this  section,  we  briefly  discuss  the  numerical  discretization  used  by  NUMA  which  includes:  the  spatial  discretization  of 
Eq.  (1)  using  a  continuous  Galerkin  (CG)  method  and  Eq.  (7)  using  a  discontinuous  Galerkin  (DG)  method  (Section  3.1);  the 
explicit  time-integrator  used  to  evolve  the  solution  forward  in  time  (Section  3.2);  the  necessary  boundary  conditions 
required  for  limited-area  problems  (Section  3.3);  and  the  artificial  diffusion  used  to  control  overshoots  (Section  3.4). 

3.1.  Spatial  discretization 

Let  us  now  describe  the  approximation  of  the  continuous  spatial  operators  using  both  CG  and  DG.  We  begin  with  a 
description  of  the  decomposition  of  the  global  spatial  domain  into  elements,  the  basis  functions  defined  within  these 
elements,  and  element-wise  integrals  that  are  common  to  both  CG  and  DG. 

3.1.1.  Basis  functions,  elements,  and  integrals 

Element-based  Galerkin  methods  such  as  the  continuous  Galerkin  and  discontinuous  Galerkin  methods  require  the 
decomposition  of  the  global  domain  Q  c  Tl3  into  Ne  non-overlapping  elements  Qe  via 


N„ 

(13) 

e=l 

In  the  current  formulation  of  NUMA,  we  let  Qe  be  hexahedra,  which  provides  simple  grid  generation  and  efficient  (fast)  eval¬ 
uation  of  the  necessary  differentiation  and  integration  operators.  We  note,  however,  that  Qe  may  be  replaced  by  tetrahedral, 
pyramidal  elements,  or  triangular  prisms;  while  this  can  be  done  for  both  CG  and  DG,  we  envision  doing  this  only  for  the  DG 
version  since  DG  does  not  require  global  matrices.  Assuming  a  purely  explicit  scheme,  CG  would  require  the  inversion  of  a 
sparse  global  mass  matrix  which  becomes  prohibitive  in  three-dimensions  for  very  large  problems  (this  is  the  case  because 
no  collocated  set  of  interpolation  and  integration  points,  that  results  in  a  diagonal  mass  matrix,  exists  for  triangular  prisms  or 
tetrahedra).  We  also  note  that  tetrahedral  and  pyramidal  elements  would  produce  an  unstructured  grid  in  the  vertical,  which 
would  impede  the  incorporation  of  column-based  microphysics  and  physical  parameterizations.  Finally,  the  tensor  product 
structure  may  be  exploited  with  hexahedral  elements,  reducing  the  complexity  of  a  3D  method  based  on  N-th  order  poly¬ 
nomials  from  C>(N6)  to  0(N4),  thereby  increasing  the  efficiency  of  EBG  methods  by  one  to  three  orders  of  magnitude  for  typ¬ 
ical  values  of  N  e  [3, 10]. 

Letting  the  unit  cube  (£,  p,Q  e  E  =  [-1 . 1]3  be  the  reference  hexahedral  element,  a  transformation  Te :  Qe  — >  E  mapping 
physical  space  to  computational  space  is  defined  for  each  element,  yielding  (x,y,z)  =  £)  where  the  associated 

Jacobian  of  Te  is  denoted  by  ]e. 
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Within  each  element  Qe,  a  finite-dimensional  approximation  qN  is  formed  by  expanding  q(x,  t)  in  basis  functions  ^-(x) 
such  that 

mn 

QnW)  =  ^iAj(x)qf(t)  (14) 

j=i 

where  MN  =  (N+  l)3  is  the  number  of  nodes  per  element,  N  is  the  order  of  the  basis  functions,  and  the  superscript  (e)  de¬ 
notes  element-wise  (local)  values.  For  basis  functions,  we  construct  tensor  products  of  Lagrange  polynomials  given  by 

•Ai(x)  =  K(0  ®  hp(ri)  ®  hy(Q  (15) 

where  hx(£)  is  the  Lagrange  polynomial  associated  with  the  Legendre-Gauss-Lobatto  (LGL)  points  {,■  and  (£,;/,£)  are  functions 
of  the  physical  variable  x.  These  LGL  points  satisfy 

(l  —  £2)£n(£)  =  o  (16) 

where  PN({)  is  the  N-th  order  Legendre  polynomial.  Hence,  we  are  utilizing  nodal  basis  functions,  as  opposed  to  modal  basis 
functions  (e.g.  Legendre  polynomials).  We  can  now  use  all  of  this  machinery  defined  to  construct  discrete  approximations  of 
the  continuous  spatial  derivatives.  For  example,  the  derivative  of  q  can  be  obtained  by  differentiating  equation  (14)  as 
follows 


mn 

Vq«(x,t)  =  ^V,AJ.(x)qf(t)  (17) 

j=i 

which,  after  multiplying  by  the  test  (basis)  function  i p  and  integrating  within  an  element  yields 

p  mn 

/  *f£v^(x)q<e>(t)dfle  (18) 

j=1 

where  we  shall  use  Nth-degree  LGL  integration  (numerical  quadrature)  points  for  approximating  the  integral.  We  note  that 
aliasing  errors  arise  due  to  an  insufficient  number  of  integration  points  in  Eq.  (18).  To  remove  these  aliasing  errors,  a  modal 
low-pass  filter  is  used  [2]. 


3.1.2.  Continuous  Galerkin 

For  CG  we  shall  use  Eq.  (1)  as  our  governing  equations.  Using  Eq.  (14)  to  approximate  qN,  multiplying  by  a  test  function 
and  integrating  as  in  Eq.  (18)  yields  the  element-wise  definition  of  the  problem 


and  applying  the  global  assembly,  or  direct  stiffness  summation  (DSS)  operator  required  by  all  CG  methods  yields  the  weak 


formulation  for  CG:  find  qN  e  such  that 

Jo  ^i^fdQ  =  t/'fS2NC(qN)df2  €  V“ 

(20) 

where 

VGG={il7€H\QMePN(l),  e  =  1, . . .  ,Ne| 

(21) 

and  PN  denotes  the  space  of  all  polynomials  of  degree  N  in  the  interval  /  =  [-1,1].  Notice  that  the  requirement  t//  e  H1  (Q) 
implies  V^G  c  C°(Q).  The  DSS  operator  /\eli  forms  global  matrices  from  the  element  (local)  matrices.  For  example,  for  the 
local  mass  matrix  defined  as 


M'f  =  [  </#, dQe  (22) 

JQe 

the  DSS  operator  produces  the  global  mass  matrix 

M,j  =  J\Mf  =  /\  f  UjdQe  (23) 

6=1  6=1  ^ 

where  the  DSS  sums  the  contribution  of  all  the  elements  e=  1  ,...,Ne  and  all  interpolation  points  within  the  elements 
i  =  1 , . . . ,  Mn  and  stores  them  in  the  global  grid  points  /  =  1 , . . . ,  NP  as  follows  (i,  e)  — >  I.  Extracting  the  global  mass  matrix 
from  Eq.  (19)  and  writing  the  operator  S  as  a  vector,  results  in  the  following  compact  matrix-vector  form 


(24) 
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where  we  now  use  the  total  derivative  with  respect  to  time  to  emphasize  that  q,  =  q,(t)  represents  the  expansion  coefficients 
that  are  only  functions  of  time.  In  addition,  note  that  the  mass  matrix  My  =  fB  \l/,\l/jdQ  is  diagonal  because  the  interpolation 
and  integration  points  have  been  deliberately  chosen  to  be  co-located;  this  approach  incurs  negligible  error  for  N  js  4  [14]. 
Denoting  the  right-hand  side  (RHS)  of  Eq.  (24)  by  R;(q),  Eq.  (24)  is  expressed  as 


f =A«r(<f) 


(25) 


Note  that  the  DSS  operator  maps  local,  element-wise  coordinates  (i,  e)  to  global  coordinates  J.  Eq.  (25)  forms  the  core  of  the 
spectral  element  method,  allowing  local,  element-wise  information  q|e)  to  propagate  to  adjacent  elements  via  the  DSS 
operator. 


3.1.3.  Discontinuous  Galerkin 

For  DG  we  use  Eq.  (7)  because  the  DG  method  requires  the  equations  to  be  in  conservation  form.  We  have  also  developed 
a  CG  code  with  set  2C  but  we  use  set  2NC  for  CG  because  it  represents  the  optimal  equation  set  for  our  needs  as  described  in 
[16]. 

Beginning  with  Eq.  (11),  using  the  basis  function  expansion,  Eq.  (14),  multiplying  by  a  test  function,  and  integrating 
element-wise  yields 


dt 


d£2(*  “j- 


^S(e)(qN)dC2e 


(26) 


Integrating  the  divergence  of  the  flux  by  parts  yields  the  weak  formulation  for  DG:  find  q|f  e 


Ij-f 


d£2(>  + 


'v/aces  p 

E  l  *<« 


t^-F^q N)dre-  (  Vi//,  ■  F(e\qN)dQe  =  f  *,S<e>(q  N)dQe  ^  eV°c 

J  Qp  J  Qp 


(27) 


where  the  DG  finite-dimensional  space  is  defined  as 

VDNG={^eL2(QMePN(I),  e  =  1, . . .  ,Ne  j  (28) 

Notice  that,  contrary  to  Eq.  (21),  there  is  no  global  continuity  requirement,  so  that  V°G<£  C°(Q).  This  is  possible  because  in 
Eq.  (27)  differentiability  is  required  separately  within  each  element  Qe,  and  not  within  the  entire  domain  Q.  The  coupling 
between  neighboring  elements  is  then  recovered  through  the  numerical  flux  (or  Riemann  solver)  F(e’k),  which  is  required 
to  be  a  single  valued  function  on  the  inter-element  boundaries. 

We  use  the  Rusanov  flux  defined  as 


F(e’fc)  =  \  [F(e)  +  F(k)  -  n(e’k)cm4q«  -  q'e))]  (29) 

where  n(el[)  is  the  unit  normal  vector  pointing  from  the  element  e  to  the  neighboring  element  k  and  cmax  is  the  maximum 
wave  speed  of  the  Euler  equations  (i.e.,  maximum  eigenvalue  of  the  Jacobian  matrix)  which  turns  out  to  be 
cmax  =  |n  •  u|  +  a  where  a  denotes  the  speed  of  sound.  Note  that  other  types  of  numerical  fluxes  are  possible  including  mul¬ 
ti-dimensional  Riemann  solvers  (e.g.  [10]).  We  have  initially  chosen  this  simple  flux  function  because  it  vastly  simplifies  the 
parallelization  strategy.  Using  a  one-dimensional  flux,  as  we  have,  reduces  the  element  communication  to  edge  neighbors 
but  restricts  the  maximum  allowable  time-step.  Conversely,  using  a  multi-dimensional  Riemann  solver  will  increase  the 
maximum  allowable  time-step  but  will  also  increase  the  size  of  the  communication  stencil.  The  vices  and  virtues  of  this 
approach  need  to  be  studied  in  detail  and  we  reserve  this  for  future  work. 


3.2.  Explicit  RK  methods 


We  use  the  strong  stability  preserving  (SSP)  Runge-Kutta  third-order,  five  stage  time-integrator  (RK35)  proposed  in  [38]. 
This  time-integrator  is  stable  for  (acoustic)  Courant  numbers  of  1.3  or  less  when  using  a  continuous  Galerkin  discretization, 
whereas  for  DG,  the  time-integrator  is  stable  for  (acoustic)  Courant  numbers  of  0.85  or  less. 

We  emphasize  that  this  explicit  RI<  method  will  not  be  used  in  an  operational  setting  due  to  the  stringent  CFL  require¬ 
ment;  rather,  we  are  developing  IMEX  methods  which  allow  much  larger  time-steps.  Nevertheless,  this  particular  time  inte¬ 
grator  was  chosen  for  this  study  for  the  following  reasons:  (1)  to  provide  high-order  accuracy  in  time  to  complement  the 
high-order  spatial  discretization,  (2)  to  minimize  the  amount  of  numerical  dissipation,  (3)  to  allow  a  relatively  larger 
time-step  (with  respect  to  other  explicit  methods),  and  (4)  to  provide  stable  solutions  for  both  CG  and  DG  methods.  Criteria 
(1)  and  (2)  are  necessary  for  resolving  fine-scale  flow  features  present  in  many  atmospheric  phenomena,  whereas  criterion 
(3)  is  necessary  for  model  efficiency,  and  criterion  (4)  is  required  in  order  to  facilitate  the  comparison  of  CG  and  DG.  In  a 
previous  study  involving  nonhydrostatic  modeling  [6],  all  available  third-order  multi-stage  methods  in  addition  to  an  assort¬ 
ment  of  multi-step  methods  were  compared  and  RK35  was  found  to  be  the  most  efficient.  In  addition,  RK35,  unlike  leapfrog, 
does  not  have  a  computational  mode  and  therefore  does  not  require  a  time  filter  (DG  also  would  not  work  with  leapfrog  due 
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to  the  eigenvalues  for  DG  not  residing  along  the  imaginary  axis  as  they  do  for  CG).  Note,  however,  that  within  a  parallel  set¬ 
ting,  each  RK  stage  requires  parallel  communication,  thus  making  this  choice  of  time-integrator  potentially  expensive.  How¬ 
ever,  this  extra  expense  is  tolerated  in  order  to  maintain  high-order  accuracy  in  time.  It  should  also  be  mentioned  that  we 
have  chosen  an  explicit  method  in  order  to  focus  the  discussion  of  CG  and  DG  with  respect  to  scalability.  The  optimal 
scalability  will  be  achieved  by  an  explicit  method.  A  study  on  the  scalability  of  IMEX  methods  requires  a  discussion  of  a  num¬ 
ber  of  topics  including:  implicit  time-integrators,  the  choice  of  iterative  solver,  and  specific  preconditioning  strategies;  all  of 
which  we  reserve  for  future  work. 


3.3.  Boundary  conditions 


Limited-area  atmospheric  models,  such  as  mesoscale  codes,  typically  require  two  types  of  boundary  conditions:  no-flux 
boundary  conditions  (NFBCs),  that  represent  impenetrable  objects  (e.g.,  the  ground)  and  non-reflecting  boundary  conditions 
(NRBCs),  that  represent  an  infinite  domain  (e.g.  the  top  of  the  atmosphere)  by  allowing  waves  to  propagate  out  of  the  com¬ 
putational  domain  without  generating  spurious  reflections.  In  this  section,  we  outline  the  implementation  of  NFBCs  and 
NRBCs  for  both  CG  and  DG.  For  additional  details,  see  [15,16]. 


3.3.2.  No-flux  boundary  conditions 

All  of  our  test  cases  use  NFBCs  on  the  bottom  boundary,  while  some  test  cases  (such  as  the  rising  thermal  bubble)  use 
NFBCs  on  other  boundaries  as  well.  For  CG,  we  enforce 


n  u  =  0  (30) 

for  all  points  on  the  boundary  F,  where  n  is  the  outward  pointing  unit  normal  vector  on  r  defined  as  n  =  (nx,ny,nz)T .  In 
order  to  apply  the  NFBC  to  the  prognostic  vector  q,  we  augment  n  to  R5  via  ii  =  (0,  nT,  0)T,  yielding  n  q  =  0.  To  apply  these 
boundary  conditions  in  the  strong  sense,  we  construct  a  3  by  3  orthogonal  projector  P  defined  as  follows: 


Z1  ~nl 

-nxny 

-nxn 

-nynx  1  -  n2y 

-nyn 

\  - nznx 

-nzny 

1  -  n 

(31) 


This  matrix  is  constructed  during  the  initialization  phase  and  applied  to  the  RHS  operator  after  each  time-step. 

On  the  other  hand,  for  DG  the  NFBCs  are  imposed  via  the  numerical  flux  as  follows.  On  boundary  edges  where  NFBCs  are 
to  be  imposed,  a  ghost-cell  is  constructed  on  the  other  side  of  the  wall  that  has  a  velocity  vector  the  negative  of  the  interior 
element.  In  other  words, 


U«  =  _u«0 


where  k  denotes  the  ghost-cell  positioned  on  the  other  side  of  a  no-flux  boundary;  note  that  this  ensures  that  the  no-flux 
condition  n  U  is  satisfied.  The  scalar  variables  are  copied  directly  such  that  p(k)  =  p(e\  etc. 


3.3.2.  Non-reflecting  boundary  conditions 

In  an  operational  mesoscale  NWP  model,  the  four  lateral  and  the  top  boundaries  should  properly  represent  an  open  do¬ 
main.  That  is,  waves  should  smoothly  exit  the  domain  without  reflection;  in  addition,  information  from  outside  the  domain 
should  be  allowed  to  enter  the  domain  of  interest.  Mathematically  modeling  this  behavior  is  non-trivial  and  has  attracted  the 
attention  of  researchers  in  many  disciplines  [5,19,28].  In  our  model,  we  use  a  simple,  albeit,  effective  absorbing  sponge  layer 
method.  The  computational  domain  is  surrounded  by  a  layer  with  Newtonian  relaxation  coefficients  a(x)  and  p(x)  such  that 
a.  =  1  and  p  =  0  in  the  domain  of  interest,  while  a  — >  0  and  p  — >  1  as  the  boundary  is  approached.  Specifically,  for  the  top 
boundary,  we  choose 


/»  = 


z-zs 

Zt-zs 


4 


(32) 


where  zt  is  the  vertical  height  of  the  domain  and  zs  is  the  bottom  of  the  sponge  layer.  Similar  functions  are  used  for  the  lateral 
boundaries  of  the  domain.  Once  the  sponge  layer  is  constructed,  the  numerical  solution  q  given  by  the  RHS  operator  is  re¬ 
laxed  to  some  known  solution  at  the  boundary  qb  via 

q  =  a(x)q  +  P{x)qb  (33) 


For  problems  under  consideration  in  this  paper,  q6  is  a  far-held  condition  (e.g.  known  wind  velocity  and  zero  perturbation  of 
scalars). 


3.4.  Artificial  diffusion 


To  add  a  certain  level  of  numerical  dissipation  we  have  implemented  second  order  artificial  diffusion  operators,  although 
hyper-viscosity  is  also  possible.  For  Eq.  (6)  we  add  the  following  right-hand-side  operator 


0 

vV  •  (Vu) 
vV  ■  (V0') 
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and  for  Eq.  (12)  we  add  the  operator 

f  W  •  (pVu)  ] 

\  vV  •  (pV0')  / 

Note  that  the  artificial  diffusion  operator  applied  to  the  potential  temperature  equation  must  act  only  on  the  potential  tem¬ 
perature  perturbation,  8',  since  applying  it  to  the  full  potential  temperature  variable  would  smoothen  the  background  refer¬ 
ence  field  and  destroy  the  hydrostatic  balance.  For  CG  the  implementation  of  the  diffusion  operator  follows  the  usual 
strategy  through  integration  by  parts.  For  DG,  we  use  the  local  discontinuous  Galerkin  method  as  described  in  [15]. 

4.  Parallel  implementations  of  EBG  methods 

NUMA  is  an  MPI-based  code  targeted  toward  distributed  memory  architectures  (e.g.  clusters).  In  this  section,  we  discuss 
the  parallel  implementation  of  NUMA,  including:  a  description  of  the  domain-decomposition  (Section  4.1),  communication 
strategies  for  both  CG  and  DG  (Sections  4.2  and  4.3),  a  comparison  of  these  communication  volumes  (Section  4.4  discussion 
of  the  memory  access  of  these  EBG  methods  (Section  4.5). 

4.1.  Domain  decomposition 

We  decompose  Q  into  Np  processor  elements  (PE)  Qp  that  consist  of  local  elements  £2®.  Mathematically,  we  rewrite  Eq. 
(13)  as 

Np  Nf 

fl  =  UU^p)  (34) 

p=le'=l 

where  Nf]  is  the  number  of  local  elements  residing  on  PE  p.  Since  the  DSS  operator  for  CG  requires  global  elements  e  and  the 
flux  integrals  for  DG  require  global  faces/,  we  must  also  construct  local  to  global  mappings  for  both  elements  e  =  LGE[p\e') 
and  faces  /  =  LGF(P)(/')  that  map  local  elements  e'  and  faces/'  on  processor  p  to  global  elements  e  and  faces /residing  on  the 
global  domain  Q.  For  CG,  the  inverse  global-local  mapping  is  also  required,  which  may  be  constructed  via  a  hash  table  or 
implicitly  using  a  binary  search. 

A  guiding  principle  in  the  construction  of  NUMA  is  to  maintain  independence  between  grid  generation  and  the  CG/DG 
solver;  hence,  domain  decomposition  should  be  as  general  as  possible  and  not  be  constrained  by  the  choice  of  grid.  There¬ 
fore,  we  have  implemented  a  domain  decomposition  strategy  based  on  the  widely  used  METIS  graph  partitioning  library 
[22],  METIS  requires  an  adjacency  graph  where  the  Ne  vertices  are  elements  Qe  and  the  “edges”  denote  the  connectivity  be¬ 
tween  elements.  Since  the  DSS  operator  requires  information  from  elements  that  share  nodes,  the  “edges”  include  all  forms 
of  geometric  connectivity  (faces,  edges,  and  vertices).  For  maximum  flexibility,  we  construct  a  weighted  adjacency  graph 
G'  =  (V,  E)  with  adjacency  matrix  A'  of  size  Ne  by  Ne  defined  as 

{v  if  i  and  j  are  vertex  neighbors 

e  if  i  and  j  are  edge  neighbors  (35) 

1  if  i  and  j  are  faces  neighbors 

where  v  and  e  are  arbitrary  weights.  Although  not  explored  in  this  paper,  the  values  of  v  and  e  may  be  chosen  to  construct  a 
machine-optimal  weighted  adjacency  matrix.  Once  A'  is  constructed  the  adjacency  matrix  A  for  the  graph  G  =  (V,F),  where  F 
are  geometrical  faces,  is  simply  a,j  =  1  if  aj.  =  1  and  ap  =  0  otherwise.  Two  example  connectivity  graphs  for  a  2D  grid  are 
shown  in  Fig.  1,  showing  both  edge  and  vertex  connectivity.  The  associated  9  by  9  adjacency  matrix  has  nodes  with  degree 
ranging  from  3  (for  the  corner  nodes)  to  8  (for  the  central  node).  In  contrast,  the  flux  integrals  only  require  face  adjacency 
information;  hence,  our  DG  code  only  uses  the  standard  adjacency  matrix  A.  Also  shown  in  Fig.  1  is  the  adjacency  matrix 
associated  with  DG,  which  only  shows  edge  (face)  adjacency.  These  adjacency  matrices,  along  with  the  number  of  processors 
Np  are  then  passed  to  METIS,  which  returns  a  partition  mapping  global  elements  to  processors.  This  mapping  is  then  used  to 
construct  the  local  to  global  mappings  LG  necessary  for  global  assembly  in  Eq.  (25),  as  well  as  processor-element  commu¬ 
nication  data  structures. 

4.2.  Parallelization:  CG 

We  first  consider  the  parallelization  of  CG.  The  global  DSS  operator  in  Eq.  (25)  requires  inter-processor  communication 
due  to  the  C°  requirement  at  processor  element  boundaries.  A  global  DSS  is  required  in  two  parts  of  the  code:  construction 
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(a)  EBG  Grid 


(b)  Generalized  Adjacency 
Graph  (CG) 


(c)  Adjacency 

(DG) 


Graph 


Fig.  1.  Example  2D  grid  (left),  the  CG  associated  adjacency  graph  (center),  and  DG  adjacency  graph  (right).  Since  CG  methods  use  nodal  communication,  the 
generalized  adjacency  graph  includes  both  edge  and  vertex  connectivity,  with  a  maximum  degree  of  eight.  For  3D,  structured,  hexahedral  grids,  the 
maximum  degree  is  26.  In  contrast,  typical  DG  methods  use  face-based  communication  and  hence  only  require  the  adjacency  graph  shown  in  the  right 
panel.  For  3D  hexahedral  grids  the  maximum  degree  for  DG  is  only  6. 

of  the  mass  matrix  and  construction  of  the  right-hand  side  (RHS).  To  perform  this  communication,  we  first  construct  the 
boundary  nodes  of  each  processor  (excluding  the  physical  boundary  where  BCs  are  applied).  We  denote  the  boundary  of 
processor  element  i  by  <9Q,.  In  order  to  communicate  between  processor  elements,  we  construct  two  data  structures  send 
and  recv.  Consider  a  particular  processor  i  and  neighboring  processor  j.  The  send  data  structure  contains  the  local  nodes 
on  processor  i  that  are  sent  to  each  neighboring  processor  j,  while  recv  contains  the  local  nodes  on  processor  j  that  must 
be  sent  to  i.  In  order  to  construct  send,  we  use  the  method  shown  in  Algorithm  1 .  The  corresponding  recv  structure  contains 
the  same  grid  points  as  send,  although  they  may  be  ordered  differently. 


Algorithm  1:  Construction  of  MPIJSEND/MPLIRECV  communication  data  structures  for  both  CG  and  DG 
for  all  NBHs  j  of  i  do 
MPIJSEND  dQf  <-  LGE{i\dQi)  to  pro cj 
MP1JRECV  dQf  <-  LGE,j)  ( dQj )  to  proc  i 
B  dQf  n  dQf 

send(j)  <-  [lGE(i)]  (B) 

end  for 


Once  these  data  structures  have  been  constructed,  the  global  mass  matrix  and  RHS  operators  may  be  constructed  via  a 
two  step  process.  The  global  mass  matrix  Ml}  and  RHS  operator  in  Eq.  (25)  are  decomposed  as 

Ne  Np  Aif 

M//=AMr=AAMr and  (36a) 

e=l  np= le'=l 

Ne  Np  N(f> 

r,  =  Me)  =  A  A  Rlf]  (36b) 

e=l  np= le'=l 

Hence,  the  global  DSS  is  decomposed  into  a  local,  on-processor  DSS  and  a  global  DSS,  that  requires  inter-processor  commu¬ 
nication.  In  this  way,  global  continuity  between  elements  is  preserved  during  the  construction  of  each  RHS.  Note  that  the 
communication  stencil  is  simply  the  boundary  of  each  processor  element  dQ®  and  is  independent  of  the  polynomial  order 
N;  that  is,  unlike  high-order  finite  difference  or  finite  volume  methods,  the  spectral-element  method  is  halo-free.  We  note 
however,  that  the  message  size  grows  with  the  surface  area  of  the  processor  element  as  N1 2 3.  The  global  DSS  procedure 
may  be  summarized  as  follows: 


Algorithm  2:  The  direct  stiffness  summation  (DSS)  operation  in  CG 

1 .  Perform  a  local  DSS  on  processor  i. 

2.  Exchange  boundary  points  between  processors  i  and  all  processor  neighbors  j  using  (asynchronous)  isend  and  irecv 
MPI  commands. 

3.  Perform  a  global  DSS  using  the  boundary  data  received  from  neighbors). 
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(a)  CG  stencil 


(b)  DG  stencil 


Fig.  2.  Computational  stencils  for  (a)  CG  and  (b)  DG  in  a  2D  setting,  where  each  processor  element  owns  one  element.  In  the  left  panel  (CG),  the  element  E 
(green)  requires  nodal  information  from  its  8  nodal  neighbors  in  order  to  construct  the  DSS  operator  on  the  boundary  (red  dots).  In  the  right  panel  (DG),  the 
same  element  E  only  requires  nodal  information  from  its  4  face  neighbors  in  order  to  construct  flux  integrals.  In  both  methods,  the  interior  nodes  (yellow 
dots)  do  not  need  to  be  communicated  to  adjacent  processors,  resulting  in  a  halo-free  algorithm.  (For  interpretation  of  the  references  to  colour  in  this  figure 
legend,  the  reader  is  referred  to  the  web  version  of  this  article.) 


The  global  DSS  operation  is  represented  schematically  in  the  left  panel  of  Fig.  2  in  a  2D  setting.  To  simplify  the  discussion, 
each  processor  is  assumed  to  own  one  element.  In  order  to  construct  the  RHS  operator  on  the  boundary  (red  dots),  the 
element  E  (green)  requires  nodal  information  for  its  8  nodal  neighbors.  These  neighbors  include  both  edge  neighbors  (2, 
4,  6,  and  8)  and  vertex  neighbors  (1,  3,  5,  and  7).  In  a  3D  setting,  a  hexahedral  element  may  have  (a  maximum)  of  6  face 
neighbors,  12  edge  neighbors,  and  8  vertex  neighbors,  for  a  total  of  26  nodal  neighbors.  Note  that  for  tetrahedral  grids, 
the  number  of  vertex  neighbors  may  be  much  greater.  To  reduce  this  communication  cost,  METIS  is  used  to  reduce  the  total 
number  of  vertex  neighbors;  in  a  typical  3D  partition,  a  processor  element  lying  away  from  a  physical  boundary  has  16  or 
fewer  nodal  neighbors;  hence,  the  METIS-based  decomposition  further  reduces  communication  cost  relative  to  a  naive 
geometric  domain  decomposition.  In  particular,  we  use  the  routine  METls_PartGraphRecursive,  which  reduces  the 
edge-cut  of  the  graph  G. 

EBG  methods  possess  purely  local  communication  stencils.  Referring  to  the  left  panel  of  Fig.  2,  the  interior  nodes  (yellow 
dots)  do  not  need  to  be  communicated  to  adjacent  processors,  resulting  in  a  halo- free  algorithm.  Thus  unlike  high-order  finite 
difference  and  finite  volume  schemes,  spectral  elements  may  achieve  spectral  accuracy  without  sacrificing  their  local  character. 

4.3.  Parallelization:  DG 

Parallelization  of  DG  is  much  more  straightforward  than  CG.  Since  there  are  no  global  mass  matrices,  all  that  is  required  is 
a  way  to  communicate  boundary  data  between  processor  elements.  In  addition  to  a  local/global  element-mapping,  an  addi¬ 
tional  local/global  face  structure  is  also  required.  Faces  on  each  processor  element  (excluding  boundary  faces)  are  classified 
into  two  types:  intra-processor  and  inter-processor.  The  necessary  send  and  receive  data  structures  are  then  constructed  by 
calculating  the  intersection  of  face  boundaries  on  each  PE  as  described  in  a  manner  similar  to  Algorithm  1.  Unlike  CG,  how¬ 
ever,  the  DG  algorithm  requires  only  the  set  of  faces  /  which  need  to  be  sent  to  adjacent  processors  and  not  the  actual  grid 
points.  Once  these  maps  and  data  structures  are  constructed,  the  flux  operators  are  evaluated  using  the  following  algorithm, 
which  employs  interleaved  computation/communication: 


Algorithm  3:  Interleaved  computation  of  the  flux  operator  in  DG 

1.  Send/Receive  boundary  data  via  non-blocking  MPI  Send/Receive.  While  waiting  for  boundary  data  do: 

2.  Compute  volume  integrals. 

3.  Compute  intra-processor  flux  integrals. 

4.  MPI_Wai t  () 

5.  Compute  inter-processor  flux  integrals. 

6.  Multiply  by  element-wise  inverse  mass  matrix. 


The  communication  stencil  is  illustrated  geometrically  in  the  right  panel  of  Fig.  2.  Like  CG,  DG  algorithms  do  not  require 
internal  nodes  to  be  communicated  to  adjacent  processors  and  is  hence  halo-less.  Moreover,  since  all  communication  is  based 
on  fluxes,  only  elements  which  share  adjacent  faces  need  to  communicate.  This  compact  stencil  has  a  crucial  impact  on  both 
the  communication  volume  and  memory  access  patterns  of  DG  when  compared  to  CG.  These  issues  are  explored  in  the  next 
two  sections. 
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We  note  that  the  focus  of  this  study  is  a  comparison  of  CG  and  DG  from  an  algorithmic  point  of  view  and  that  the  par¬ 
allelization  algorithms  outlined  in  this  and  the  previous  section  are  not  necessarily  optimal.  In  particular,  we  have  not  imple¬ 
mented  sophisticated  scheduling  between  the  individual  MPI  processes  using  MPl_Waitany  or  Mpi_Waitsome,  nor  have  we 
exploited  topology  aware  communication.  We  are  currently  addressing  these  implementation  details  in  our  codes.  Therefore, 
it  must  be  understood  that  both  methods  can  indeed  be  further  optimized;  however,  the  results  we  present  represent  the 
simplest  approaches  for  constructing  parallelization  strategies. 


4.4.  Communication  volume 


A  simple  model  [9]  for  the  communication  time  (or  cost)  of  a  single  MPI  process  is 

tc  =  ta(a  +  pm)  (37) 

where  ta  is  the  time  to  do  a  floating-point  operation,  a  is  proportional  to  the  latency  (i.e.,  message  startup  cost),  and  p  is  the 
asymptotic  transfer  rate  (i.e.,  inverse  bandwidth).  Typically,  the  latency  cost  a  is  larger  than  bandwidth  cost  [I  by  one  to  three 
orders  of  magnitude,  depending  on  the  architecture  of  the  machine.  To  characterize  the  trade-off  between  bandwidth  and 
latency,  the  ratio  p/a  provides  a  useful  metric;  a  typical  value  is  p/a  =  0.01.  Using  the  model  given  by  Eq.  (37)  the  commu¬ 
nication  overhead  of  both  DG  and  CG  may  be  quantitatively  analyzed. 

Consider  a  hexahedral  processor  element  (PE)  with  Ne  elements  discretized  with  order  N  polynomials.  Geometrically,  the 
lengths  of  the  face,  edge,  and  vertex  messages  are  approximated  as 

mf  =  nlJarN2e,3(N  +  l)2  (38a) 

me  =  rwN’/3(N+l)  (38b) 


mv  =  nvar  (38c) 

where  nvar  =  5  is  the  number  of  state  variables,  N2J3  approximates  the  number  of  faces  while  N]/3  the  number  of  edges. 

For  DG,  since  the  inter-processor  fluxes  only  require  face  communication,  and  since  each  PE  has  an  average  of  six  face 
neighbors,  the  total  communication  cost  at  each  stage  of  the  RI<  time-integrator  is 

CDG  =  taa(6  +  6^nmrN2e/3(N+  l)2^).  (39) 

Asymptotically,  we  see  that  the  communication  cost  scales  as  C>(n2/3N2^.  In  contrast,  the  computation  volume,  which  is 
dominated  by  the  construction  of  volume  integrals,  scales  as  o(NeN4^J ;  hence,  for  large  N  and/or  large  Ne,  the  ratio  of  com¬ 
munication  to  computation  o(n/1/3N  2^)  tends  to  zero,  thereby  ensuring  weak  scaling.  In  other  words,  the  total  wallclock 

time  remains  constant  if  the  number  of  PEs  grows  at  the  same  rate  as  the  number  of  elements. 

In  contrast,  for  CG  the  construction  of  the  DSS  operator  requires  edge  and  vertex  communications  in  addition  to  face  com¬ 
munication.  Since  for  a  hexahedral  PE,  there  are  12  edges  and  8  vertices,  the  communication  cost  is 


Ccc  =  6tfl(a  +  pmf)  +  12  ta(a  +  pme)  +  8  ta{a  +  pmv) 

=  taa  ( 26  +  6  nvarN2J3  (N  +  1  )2  +  8  -  nvarN]/3  (N+  1)  +  8-n„ 


(40) 


As  with  DG,  the  communication  cost  scales  as  o(^N2/3N2)  with  a  similar  estimate  for  computation;  hence,  in  an  asymptotic 
sense,  both  CG  and  DG  have  the  same  communication  footprints.  Geometrically,  this  can  be  understood  since  the  surface 
area  of  the  faces  (required  by  both  CG  and  DG)  of  a  PE  grow  faster  than  the  edges  and  vertices  (that  are  required  only  by 
CG).  These  asymptotic  estimates  are  not  significant  for  practical  choices  of  N  and  Ne,  e.g.,  for  N  e  [4, 16]  and  small  values 
of  Ne  the  intermediate  behavior  of  Eqs.  (39)  and  (40)  become  quite  important.  We  note  briefly  that  the  communication  foot¬ 
print  of  CG  methods  may  be  reduced  to  a  footprint  similar  to  DG  using  the  d-directional  exchange  algorithm  proposed  in 
[11];  however,  implementation  of  this  algorithm  is  non-trivial  and  limited  to  tensor  product  meshes,  and  is  therefore  not 
explored  in  this  study. 

A  typical  ratio  of  bandwidth  to  latency  on  modern  computers  is  p/a  =  0.01.  Consider  the  case  of  one  element  per  PE, 
Ne  —  1 ,  and  N  =  4.  Then  CCg/Cdg  ss  2.6.  Hence,  CG  has  a  significant  communication  overhead  relative  to  DG  in  practice.  To 
help  explain  this  communication  overhead,  we  plot  the  ratio  of  Eqs.  (40)  and  (39)  in  Fig.  3  for  three  different  values  of 
p/a  and  a  range  of  polynomial  orders  N.  Several  trends  are  evident  from  this  graph.  Firstly,  for  all  bandwidth/latency  ratios, 
the  communication  overhead  is  greatest  for  linear  (N  =  1 )  polynomials  and  decreases  monotonically  to  one  as  JV  — ►  oo.  This 
behavior  is  expected  from  the  asymptotic  estimates  made  above.  However,  from  a  practical  point  of  view,  the  machine  archi¬ 
tecture  and  MPI  implementation,  which  is  characterized  by  the  ratio  p/a,  determine  the  relative  communication  footprint  of 
CG  to  DG.  As  the  latency  costs  increase  relative  to  bandwidth  costs  ( p/a  decreases),  CG  incurs  greater  communication  rela- 
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Fig.  3.  The  Ratio  of  communication  costs  CCc/Cdg  plotted  for  transfer  rate/latency  ratios  fi/tx  =  0.1,0.01  and  0.001  for  polynomial  orders  N  =  1  to  16. 

tive  to  DG.  Since  the  general  trend  in  HPC  is  toward  greater  bandwidth  (1  /yS  is  large  and  therefore  f)  is  small)  and  greater 
latency  (large  a),  we  expected  DG  to  significantly  outperform  CG  in  terms  of  communication  footprint  as  fJ/c/.  decreases. 

Note  that  for  both  CG  and  DG,  the  communication  volume  is  ofN2/3N2J  while  the  computation  volume  is  o(NeN4^. 
Hence,  for  large  number  of  elements  and  (relatively)  large  orders  N,  the  computation  volume  overwhelms  the  communica¬ 
tion  volume.  Therefore,  in  the  DG  algorithm  described  in  Algorithm  3,  steps  2  and  3  typically  require  much  more  time  than 
the  communication  in  step  1.  For  this  reason,  DG  is  expected  to  scale  to  massive  numbers  of  cores  on  distributed  memory 
architectures. 

4.5.  Memory  access 

Fetching  data  from  memory  poses  a  serious  bottleneck  in  any  HPC  application.  The  latency  associated  with  reading  data 
from  main  memory  can  adversely  affect  the  efficiency  and  scaling  of  any  HPC  application.  In  general,  EBG  methods  exhibit  a 
high  level  of  data  locality  since  the  computational  work  is  organized  into  discrete  elements.  In  this  section,  we  explore  the 
impact  of  data  layout  and  access  in  CG  and  DG  and  how  this  data  access  affects  performance. 

Fig.  4  shows  the  grid  numbering  used  by  (a)  CG  and  (b)  DG  for  Ne  =  4  and  N  =3.  Let  us  assume  for  the  moment  that  CG 
uses  a  global  indexing  while  DG  must  use  a  local  indexing.  This  data  layout  has  an  impact  on  the  amount  of  data  that  must  be 
fetched  from  memory  and  stored  in  fast  cache  for  an  EBG  method.  In  DG,  the  construction  of  volume  integrals  is  element- 
local.  Since  the  data  for  each  element  lies  in  a  contiguous  region  of  memory,  all  of  the  data  for  each  element  may  be  fetched 
from  slow  memory  loaded  into  fast  cache  at  the  beginning  of  an  element-wise  operation.  The  only  operation  that  requires 
access  to  a  neighboring  element’s  data  is  the  construction  of  flux  integrals.  Hence,  during  the  construction  of  a  RHS,  the  num¬ 
ber  of  cache  misses  is  small. 

For  CG,  the  state  vector  is  typically  stored  in  a  continuous  block  of  memory,  rather  than  in  an  element-wise  fashion.  Note 
that  it  is  possible  to  use  element-wise  storage  in  CG  using  an  additional  mapping  operator;  however,  in  this  analysis,  we 
discuss  only  the  standard  finite  element  (global)  numbering.  Obviously,  this  data  layout  uses  less  memory  than  the  ele¬ 
ment-wise  DG  layout;  however,  a  penalty  in  performance  may  be  incurred.  To  construct  a  RHS,  first  data  must  be  fetched 
from  the  global  state  vector  and  stored  in  a  local  array.  Cache  misses  may  occur  at  all  points  on  the  boundary  of  the  element, 
especially  for  large  grids,  where  there  is  a  large  stride  in  indexing.  Element-wise  operations  are  then  performed,  followed  by 
DSS,  which  requires  updating  the  global  state  vector,  thus  resulting  in  additional  cache  misses.  These  cache  misses  are  exac¬ 
erbated  in  3D,  where  a  typical  element  requires  data  from  26  adjacent  elements.  Although  these  cache  misses  may  be  par¬ 
tially  alleviated  by  grid  renumbering,  a  CG  method  can  never  achieve  the  natural  data  locality  of  a  DG  method.  Even  if  CG 
uses  DG  storage,  the  cache  misses  associated  with  the  DSS  operator  cannot  be  circumvented;  the  DSS  operator  requires  indi¬ 
rect  memory  access  that  requires  striding  through  the  memory  in  order  to  enforce  C°  continuity. 

For  optimal  performance,  the  entire  on-processor  problem  should  fit  in  cache.  In  this  situation,  which  only  occurs  for 
small  test  problems  and/or  large  processor  counts,  no  cache  misses  will  happen.  A  rough  estimate  for  the  size  of  the  on-pro- 
cessor  problem  is  made  for  both  DG  and  CG.  To  construct  a  RHS  requires  the  following  data  structures:  (1 )  a  state  vector,  (2) 
reference  vector,  (3)  RHS  vector,  (4)  the  inverse  mass  “matrix”,  and  (5)  nine  metric  terms  and  two  Jacobian  matrices  (for  both 
volume  and  flux  integrals).  The  data  structures  in  items  (l)-(3)  each  have  size  nmrNe(N  +  l)3  with  nvar  =  5,  whereas  struc¬ 
tures  (4)  and  (5)  have  size  Ne(N  +  l)3,  thus  yielding  26 Ne(N  +  l)3  real  numbers.  A  typical  64-bit  machine  uses  8  bytes  to  store 
a  real,  yielding  a  problem  size  of  208 Ne(N  +  l)3.  The  memory  requirements  for  CG  is  slightly  smaller,  since  data  is  stored 
using  a  global,  rather  than  element-based  index;  however  the  magnitude  is  similar.  On  typical  clusters  (e.g.,  TACC’s  Ranger), 
each  core  (and  hence  each  PE)  has  a  cache  with  size  512  KB,  implying  that  a  single  element  can  fit  into  cache  provided 
N  ^  12.  Using  this  estimate,  approximately  20  continuous  elements  fit  into  cache  for  N  =  4,  while  about  3  elements  fit  into 
cache  for  N  =  8. 
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Fig.  4.  Grid  numbering  for  (a)  CG  with  global  indexing  and  (b)  DG  with  local  indexing  for  Ne  =  4  and  N  =  3. 

From  these  estimates,  the  number  of  cache  misses  for  both  CG  and  DG  may  be  estimated.  For  DG,  the  number  of  cache 
misses  may  vary  from  0  to  6.  For  a  very  large  problem  or  a  small  number  of  PEs,  a  cache  miss  will  occur  during  each  flux 
construction,  when  data  from  an  adjacent  processor  is  accessed,  yielding  6  misses.  For  intermediate  sizes,  a  cache  miss  does 
not  occur  when  fetching  data  from  an  element  with  continuous  numbering,  yielding  2  or  4  misses;  finally,  for  very  small 
problems,  all  the  data  is  in  cache,  yielding  no  misses.  For  CG,  the  number  of  misses  ranges  from  0  to  52.  For  large  problems, 
reading  and  writing  any  data  from  an  adjacent  element  will  incur  a  cache  miss,  yielding  2*26  =52  misses.  For  smaller  prob¬ 
lems,  this  count  may  be  reduced  to  48  or  fewer  misses,  depending  on  the  configuration  of  the  sub-domain.  Excluding  sub- 
domains  small  enough  to  fit  into  cache,  the  number  of  cache  misses  associated  with  CG  will  be  many  times  larger  than  the 
misses  associated  with  DG  regardless  of  the  numbering  scheme  used  in  CG  (i.e.,  even  if  element-wise  DG  numbering  is  used 
with  CG). 


5.  Results 

5.1.  Test  cases 

Although  a  standard  set  of  2D  mesoscale  test  cases  has  been  proposed  [34]  and  later  used  within  an  element-based  Galerkin 
framework  [  1 5  ],  an  analogous  suite  of  3D  test  cases  has  not  yet  been  developed.  Fortunately,  a  3D  dynamical  core  can  be  run  in 
2D  mode  by  imposing  symmetry  in  the  y-dimension  and  periodic  boundary  conditions  in  they-lateral  boundaries.  For  initial 
verification  of  NUMA,  we  used  the  test  cases  proposed  in  [34],  followed  by  full  3D  test  cases  with  no  y-symmetry.  In  the  follow¬ 
ing  analysis,  we  consider  two  test  cases:  a  3D  linear  hydrostatic  mountain  and  a  3D  rising  thermal  bubble. 


5.1.1.  3D  linear  hydrostatic  mountain 

To  test  the  3D  capabilities  of  NUMA  and  the  NRBCs,  we  consider  stratified  flow  past  an  isolated  mountain  as  outlined  in 
[35].  An  initial  horizontal  flow  U  =  20  m/s  goes  past  a  mountain  with  orography  given  by 


h{x,y) 


_ ho _ 

(x2/a2  +y2/a2  + 1)3/2 


(41) 


with  mountain  half-width  a  =  10  km  and  height  h0  =  1  m.  The  hydrostatic  background  is  specified  by  a  constant  Brunt- 
Vaisala  frequency  Nbv  =g/^cpT0  with  ground  temperature  T0  =  250  K.  In  other  words,  we  consider  an  isothermal  atmo¬ 
sphere.  No  flux  boundary  conditions  are  imposed  on  the  bottom  of  the  domain,  while  NRBCs  are  imposed  on  the  four  lateral 
boundaries  and  the  top  boundary.  In  order  to  verify  our  numerical  results,  several  analytical  results  from  [35,36]  are  used. 
First,  a  contour  integral  solution  for  the  density  perturbations  p'  valid  under  a  linear  Boussinesq  approximation  is  used.  Also, 
the  velocity  perturbations  parallel  and  perpendicular  to  the  flow  (u'  and  v')  are  known  for  observation  points  near  the 
ground  (z  =  0)  (see  Eqs.  (39)  and  (41 )  in  [35]): 


u'(x,y,  0)  =  hNbv 


x/a 


1  +x2/a2  +y2/a 2 


(42a) 


z/(x,y,0)  =  hNbv 


y/q 

1  +x2/a2  +y2/a2 


under  the  same  approximation. 


(42b) 
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5.1.2.  3D  rising  thermal  bubble 

We  consider  a  3D  buoyant  thermal  bubble  rising  in  a  neutrally  stratified  atmosphere  [39],  which  is  the  3D  extension  of 
the  2D  thermal  bubble  originally  considered  in  [37],  The  hydrostatic  potential  temperature  60(z)  =  300  K  (neutral  atmo¬ 
sphere)  is  perturbed  with  a  sphere  of  radius  rc  =  250  m  centered  at  (xc,yc,zc)  =  (500, 500, 260)  m  defined  by  a  cosine  taper 
given  by 


B'  =  A 


1  +  cos 


(43) 


with  r  =  ||x  —  xcj|2  where  ||  •  ||2  denotes  the  2-norm,  and  A  =  \  is  a  constant.  Unlike  the  test  case  used  in  [39],  our  problem  has 
a  C1  initial  condition,  thus  mitigating  unphysical  oscillations  at  the  interface  of  the  bubble.  The  domain  is  defined  as 
(x,y,z)  e  [0, 1000]3  m.  No-flux  boundary  conditions  are  applied  on  all  six  boundaries. 


5.2.  Numerical  verification 


The  numerical  verification  proceeds  in  three  steps.  In  phase  one,  we  ran  the  code  in  pseudo-2D  mode  using  1  element  and 
periodic  boundary  conditions  in  the  y-direction.  The  numerical  results  for  the  standard  mesoscale  suite  [34]  are  directly 
compared  to  the  results  of  our  existing  2D  model  [15].  In  phase  two,  we  considered  the  linear  isolated  mountain  problem, 
which  possesses  an  approximate  analytical  solution  in  the  form  of  a  contour  integral.  In  phase  three,  we  consider  a  three- 
dimensional  buoyant  convection  problem.  Although  this  problem  does  not  have  an  analytic  solution,  its  qualitative  behavior 
is  well  understood. 


5.2.1.  3D  linear  hydrostatic  mountain 

In  order  to  test  the  3D  operators,  orography,  and  non-reflecting  boundary  conditions,  flow  over  a  3D  isolated  linear  hydro¬ 
static  mountain  is  considered.  This  problem  may  be  solved  under  the  linear  Boussinesq  approximation  via  a  contour  integral 
technique  as  described  in  [36].  In  addition,  closed  form  expressions  for  both  the  down-stream  and  cross-stream  velocity 
components  may  be  used  to  judge  the  numerical  solution  but  only  in  a  qualitative  sense  (i.e.,  since  we  use  the  fully  com¬ 
pressible  equations  then  we  cannot  use  the  linear  Boussinesq  analytic  solution  to  compute  convergence  rates).  In  our  sim¬ 
ulations,  we  used  20  elements  in  the  x  and  y  dimensions  and  10  in  the  z  direction  with  eighth-order  polynomials  yielding 
effective  resolutions  of  Ax  =  Ay  =  1.5  km  and  A z  =  300  m.  The  explicit  time-integrator  was  run  at  the  maximum  allowable 
time-step  for  each  method. 

Fig.  5  compares  the  density  perturbations  p’  of  the  analytic  and  numerical  solutions.  In  panel  (a),  contours  for  p’  (analytic 
are  solid  while  NUMA-CG  are  dashed)  are  shown  after  5  h  of  simulation  time  and  compared  to  the  contour  integral  solution; 
after  5  h  the  models  have  converged  to  steady-state.  The  results  of  NUMA-DG  are  similar  to  the  dashed  contours  in  Fig.  5(a). 
Panel  (b)  of  this  figure  compares  the  numerical  results  produced  by  NUMA-CG,  NUMA-DG,  and  Smith’s  analytical  model  along 
the  line  x  =  y  =  1 20, 000  m.  Agreement  is  very  good  near  the  mountain,  while  the  two  solutions  begin  to  deviate  as  z  increases 
due  to  the  influence  of  the  sponge.  Agreement  between  NUMA  and  Smith’s  results  may  be  brought  into  closer  agreement  by 
increasing  the  resolution  of  the  model;  at  least  this  is  the  case  in  the  2D  analog  of  the  problem  (see  [15,16]).  Recall  that  the 
“analytic"  solution  used  here  is  for  a  Boussinesq  model  and  therefore  one  should  not  expect  to  get  exact  agreement.  Additional 
verification  is  performed  by  comparing  the  velocity  on  the  surface  of  the  mountain.  The  down-stream  velocity  perturbation  u' 


Fig.  5.  Comparison  of  the  density  perturbations  p'  for  the  isolated  linear  hydrostatic  mountain  using  NUMA  (solid  line)  and  a  contour  integral  solution 
(dashed  line)  [36],  In  panel  (a),  the  p'  contours  in  the  plane  y  =  120, 000  m  are  shown  (the  analytical  model  are  solid  while  the  CG  model  are  dashed  -  the 
DG  results  are  similar  to  those  for  CG  and  are  not  shown).  In  panel  (b),  p‘  is  shown  as  a  function  of  z  using  x  =  y  =  120, 000  m  using  numerical  results 
produced  by  the  CG  and  DG  models,  and  Smith’s  analytical  model. 
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and  cross-stream  velocity  perturbation  v'  at  the  ground  for  both  the  CG  and  DG  models  are  compared  with  the  analytical  for¬ 
mulas  in  Eq.  (42).  Fig.  6  displays  the  results  of  this  comparison  after  t  =  5  h  of  integration.  Agreement  between  the  CG  and  DG 
models  and  the  analytical  formulas  given  by  Eq.  (42)  is  satisfactory,  especially  for  the  cross-stream  velocity  perturbation.  We 
attribute  the  slight  deviation  of  the  down-stream  velocity  to  the  sponge  layer  in  the  model,  which  drives  the  total  down¬ 
stream  velocity  to  a  constant  velocity  of  20  m/s  at  the  boundary  of  the  computational  domain.  This  additional  forcing  term 
affects  the  quality  of  the  solution  near  the  sponge  layer.  In  fact,  for  this  particular  case,  the  NRBCs  dominate  the  solution;  a 
similar  behavior  is  seen  in  the  2D  mountain  cases  in  the  NUMA2D  results  presented  in  [15], 

5.2.2.  3D  rising  thermal  bubble 

Finally,  the  results  of  the  buoyant  convection  experiment  are  shown  in  Figs.  7  and  8.  Numerical  results  using  both  NUMA- 
CG  and  NUMA-DG  are  shown.  We  used  a  total  of  103  elements  with  N  =  8  polynomials,  yielding  an  effective  resolution  of 
12.5  m.  Due  to  the  coarse  resolution  of  this  run,  a  small  amount  of  artificial  diffusion  v  =  0.5  m2/s  was  used  to  suppress  grid 
noise.  In  the  CG  simulation,  a  time-step  of  At  =  0.01  s  was  used,  whereas  DG  required  At  =  0.005.  A  smooth  potential  tem¬ 
perature  perturbation  in  an  initially  neutral  atmosphere  generates  vertical  updrafts  and  shear  velocities,  which  cause  the 
bubble  to  rise,  deform,  and  transition  to  turbulence.  Fig.  7  displays  x-z  cross-sections  of  the  potential  temperature  perturba¬ 
tion  for  t  =  0,  200,  and  400  s,  while  Fig.  8  displays  a  1 D  cross-section  along  the  line  x  =  y  =  500  m  for  both  CG  and  DG.  Note 
that  the  CG  result  displays  significant  Gibbs  oscillations,  while  the  Gibbs  oscillations  are  less  pronounced  in  the  DG  code;  this 
behavior  is  expected,  since  DG  is  known  to  capture  sharp  gradients  more  effectively.  The  rising  thermal  bubble  problem  tests 
the  models’  ability  to  capture  the  effects  of  turbulence  and  turbulent  convection.  Although  no  analytical  solution  exists  for 
this  problem,  the  numerical  results  are  physically  plausible  and  resemble  previous  3D  bubble  experiments  in  [39,1].  Similar 
results  are  seen  for  both  the  NUMA-CG  and  NUMA-DG  codes. 

For  this  test  case,  the  relative  conservation  of  both  mass  (M)  and  energy  (E)  for  both  CG  and  DG  were  calculated  to  be 
M  »  10  13  and  E  &  10  1 .  In  other  words,  both  CG  and  DG  conserve  mass  to  a  level  approaching  machine  precision,  whereas 
energy,  which  is  not  a  conservation  variable  in  either  model,  is  not  conserved.  Flence,  there  are  no  particular  disadvantages  to 
using  a  non-conservative  form  of  the  equations  within  the  CG  model. 

5.3.  Parallel  performance 

Both  NUMA-CG  and  NUMA-DG  have  been  deployed  on  TACC’s  Ranger  Sun  Constellation  cluster  [3].  It  is  important  to 
understand  that  there  may  exist  more  optimal  parallel  implementations  and,  therefore,  our  results  are  illustrative  of  two 
very  specific  choices  for  constructing  parallel  algorithms  (as  described  in  Section  4).  However,  the  specific  choices  we  have 
made  in  our  parallel  implementation  are  either  standard  or  simple  in  that  they  represent  the  most  obvious  ways  of  con¬ 
structing  parallelization  strategies  for  both  the  CG  and  DG  methods. 

Two  test  problems  using  the  rising  thermal  bubble  are  executed  using  a  fixed  grid  comprised  of  323  =  32, 768  elements 
with  both  N  =  4  and  N  =  8  polynomials.  The  N  =  4  simulation  has  an  effective  resolution  of  7.8  m  in  both  the  horizontal  and 
vertical,  while  for  N  =  8  the  effective  resolution  is  3.9  m.  The  maximum  allowable  stable  time  step  was  used  for  both  the  CG 
and  DG  models  (note  that  the  CG  model  uses  a  time-step  twice  as  large  as  DG  and  the  N  =  4  simulations  allow  a  time-step 


x  (km) 

(a)  down-stream  velocity 


y  (km) 

(b)  cross-stream  velocity 


Fig.  6.  Comparison  of  the  (a)  down-stream  velocity  perturbation  u1  and  (b)  cross-stream  velocity  perturbation  z/  for  the  isolated  mountain  at  ground  level. 
Agreement  between  the  CG  and  DG  models  and  the  analytical  formulas  given  by  Eq.  (42)  is  satisfactory,  especially  for  the  cross-stream  velocity 
perturbation. 
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(d)  t  =  200  s  (CG) 
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(e)  t  =  400  s  (DG)  (f)  t  =  400  s  (CG) 

Fig.  7.  Evolution  of  a  3D  rising  thermal  bubble  problem  ( x  -  z-slices  of  the  potential  temperature  perturbation  S')  in  they  =  500  m  plane  for  t  =  0,  200,  and 
400  s  using  both  NUMA-CG  and  NUMA-DG.  A  grid  consisting  of  Ne  —  103  elements  with  N  =  8  polynomials  is  used. 


twice  as  large  as  those  for  N  =  8).  Due  to  limitations  on  the  available  computational  resources,  only  one  run  at  each  processor 
count  was  feasible.  Also,  due  to  the  size  of  this  problem,  running  the  model  in  serial  was  impractical;  therefore,  the  wallclock 
time  for  each  method  at  16  cores  is  assumed  to  be  16  times  the  serial  wallclock  time. 
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Fig.  8.  Numerical  results  for  rising  thermal  bubble  problem  along  the  line  x  =  y  =  500  m  using  both  NUMA-CG  and  NUMA-DG. 


5.3.  J.  Results  for  4th  and  8th  order  polynomials 

Fig.  9  displays  both  (a)  the  wallclock  time  and  (b)  speedup  for  the  CG  and  DG  methods  using  N  =  4.  In  panel  (b),  the  ideal 
speedup  (perfect  scalability)  is  also  shown.  Both  methods  exhibit  nearly  perfect  linear  scaling  to  2048  cores.  Beyond  2048 
cores,  however,  the  scaling  of  CG  begins  to  decline  while  DG  continues  to  scale  ideally  up  to  8192  cores.  At  16,384  cores, 
the  DG  scalability  plateaus.  Let  us  now  see  what  happens  when  we  increase  the  polynomial  order. 

Fig.  10  displays  the  same  data  for  N  =  8.  Both  CG  and  DG  exhibit  nearly  perfect  linear  scaling  to  512  cores.  Between  512 
and  2048,  CG  loses  perfect  scaling  but  recovers  beyond  2048  cores.  Note  that  DG  exhibits  perfect  linear  scaling  all  the  way 
through  to  32,768  processors  which  is  the  maximum  number  of  processors  that  a  323  element  simulation  can  use.  At  this 
point,  each  PE  contains  only  one  element  demonstrating  the  strong  scalability  of  the  method. 


5.3.2.  Communication  cost 

The  difference  in  the  scaling  behavior  between  the  two  methods  can  be  explained  by  examining  the  relative  communi¬ 
cation  footprints  of  both  CG  and  DG.  In  3D,  each  CG  processor  element  (PE)  may  have  up  to  26  neighbors,  whereas  DG  has 
only  6;  hence,  NUMA-DG’s  communication  footprint  is  more  than  4  times  as  compact  as  NUMA-CG.  In  addition,  DG  has  a 
more  local  memory  access,  since  DG  stores  data  on  an  element-by-element  basis.  Therefore,  only  the  flux  computations 
require  non-local  memory  fetches;  in  contrast,  CG  uses  a  global  index,  and  requires  many  more  non-local  memory  accesses 
although  this  overhead  can  be  eliminated  if  DG-type  local  indexing  is  used. 

Effect  of  polynomial  order.  Comparing  Figs.  9  and  10,  we  also  see  the  dependence  of  scalability  on  the  order  of  the  method. 
The  number  of  grid  points  for  a  high-order  EBG  method  is  0(NeN 3),  whereas  the  computation  volume  is  0[NeN4y,  hence, 


(a)  wallclock  time  (b)  speedup 

Fig.  9.  Scaling  study  performed  using  the  NUMA-CG  and  NUMA-DG  codes  for  the  RTB  problem  using  323  elements  and  N  =  4  polynomials.  The  wallclock 
time  refers  to  the  total  simulation  time. 
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(a)  wallclock  time 


(b)  speedup 


Fig.  10.  Scaling  study  performed  using  the  NUMA-CG  and  NUMA-DG  codes  for  the  RTB  problem  using  323  elements  and  N  =  8. 


the  computation  density  is  0(N).  In  contrast,  the  communication  volume,  as  described  in  Section  4.4,  grows  only  as 
o(N2e/3N2y  Hence,  as  we  increase  the  order  JV,  the  computation  density  grows  as  N  whereas  the  communication  density 
(ratio  of  communication  volume  to  the  number  of  grid  points)  decreases  as  1/N.  From  a  computer  architecture  viewpoint, 
the  data  locality  for  both  CG  and  DG  increases  with  order,  which  is  advantageous  for  both  distributed  memory  implementa¬ 
tions  and  shared-memory  implementations  using  graphical  processor  units  (GPUs)  [26].  A  high-order  method,  while  more 
expensive  at  low  processor  counts,  becomes  more  efficient  as  the  processor  count  increases. 

For  example,  at  16  PEs,  DG  requires  650  s  of  wallclock  time  at  JV  =  4  and  15,000  s  at  N  =  8  (both  using  323  elements), 
yielding  a  ratio  of  23.7  for  the  wallclock  times,  i.e., 


R\6 


WTn=s\ 

WTn=4/  nproc=  16 


23.7 


where  WT  is  the  wallclock  time.  In  contrast,  at  8182  PEs,  the  same  simulation  requires  1.5  s  at  JV  =  4  and  25  s  at  JV  =  8,  for  a 
ratio  of  Rg192  =  16.6;  therefore  going  from  16  to  8192  processors  has  decreased  the  computational  cost  ratio  between  the  4th 
and  8th  order  simulations.  We  attribute  this  phenomena  to  the  low  communication  volume  and  increased  data  locality  (e.g., 
greater  on-processor  work  per  grid  point)  achieved  by  DG  as  order  JV  increases.  Of  course,  the  same  arguments  hold  for  CG 
although  the  curves  are  not  as  straight  as  they  are  for  DG. 

Note  that  for  a  fixed  number  of  elements  (in  this  case  323  elements)  and  increasing  the  polynomial  order  from  N  =  4  to 
JV  =  8  increases  the  size  of  the  problem  by  a  factor  of  32:  the  number  of  grid  points  increases  by  a  factor  of  8  and  the  number 
of  time-steps  by  a  factor  of  4.  Although  the  JV  =  8  simulation  is  32  times  larger  than  the  N  =  4  simulation,  the  additional  cost 
of  increasing  the  order  JV  decreases  as  we  move  to  larger  core  counts.  For  example,  the  minimum  wallclock  time  in  Fig.  9  for 
JV  =  4  is  1.5  s  (corresponding  to  8192  processors)  while  the  minimum  wallclock  time  in  Fig.  10  for  JV  =  8  is  7  s  (correspond¬ 
ing  to  32,768  processors).  In  this  case,  we  see  that  the  cost  ratio  between  8th  and  4th  order  is  now  4.7  even  though  the  ratio 
of  the  problem  size  is  32. 

Complexity  analysis:  a  geometrical  view.  Another  way  of  interpreting  the  results  in  Figs.  9  and  10  involves  quantifying  the 
operation  count  that  takes  place  on-processor  versus  the  size  of  the  message  that  has  to  be  communicated  across  processors. 
Geometrically,  this  can  be  viewed  as  the  ratio  between  the  computational  volume  (volume  of  the  PE)  and  communication 
footprint  (surface  area  of  the  PE).  To  simplify  the  following  discussion  let  us  first  define  the  computational  volume  on  a 
specific  hexahedral  processor  element  (PE)  as 

VPE  =  o[Nenmr(N+-l)4^ 


while  the  surface  area  of  this  HPE  that  defines  the  communication  cost  is 


SPE  =  o(N2J3nmr(N  +  l)2) 

where  Ne  denotes  the  number  of  elements  per  PE,  nvar  are  the  number  of  variables  in  our  equations  ( nvar  =  5  in  our  case),  and 
JV  denotes  the  order  of  the  polynomial;  these  relations  come  from  Eq.  (38).  Note  that  the  term  N2/3  represents  the  number  of 
faces  of  a  PE  (i.e.,  the  convex  hull  of  the  PE).  Using  these  two  relations,  we  can  now  define  the  ratio 

74  =  ^  =  o  (Ne  [PE\ v /3  (JV  +  1  )2) 

■JPE  V  / 
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where  Ne[PE]  denotes  that  (for  a  fixed  grid)  the  number  of  elements  per  processor  Ne  is  a  function  of  the  number  of  proces¬ 
sors  used  (PE);  this  ratio  we  now  use  to  compare  Figs.  9  and  10. 

Fig.  9  shows  that  for  N  =  4  both  CG  and  DG  lose  perfect  scalability  between  8192  and  1 6,384  processors.  Using  a  grid  with 
323  total  elements  we  can  determine  that  8192  processors  gives  Ne  =  4  and  for  16384  processors  yields  Ne  =  2.  From  these 
values  we  determine  that  =  39  and  72.46384  =  31.  In  contrast  for  N  =  8,7£j]192  =  127  and  1Z36384  =  101  and  scalability  is 
maintained  (Fig.  10).  This  tells  us  that  for  the  entire  range  of  possible  number  of  processors,  for  N  =  8,  the  ratio  of  the  on- 
processor  work  to  inter-processor  communication  is  ^  100  which  means  that  the  communication  is  overwhelmed  by  the 
computation  and  we  therefore  see  perfect  scalability.  The  JV  =  4  results  tell  us  that  a  factor  of  40  or  less  is  not  sufficient 
to  maintain  scalability.  Looking  again  at  Fig.  9  we  see  that  for  N  =  4  we  maintain  perfect  linear  scalability  for  both  CG 
and  DG  at  2048  or  fewer  processors.  For  these  values  we  determine  that  7£2048  =  62  so  we  can  conjecture  that  the  factor 
of  on-processor  work  to  inter-processor  communication  can  be  as  low  as  62  while  still  maintaining  scalability. 

5.3.3.  Memory  access 

The  previous  discussion  has  focused  on  the  effects  of  load-balancing  and  communication  on  scaling.  However,  memory 
access  must  also  be  considered  in  order  to  achieve  maximum  parallel  efficiency.  In  particular,  we  wish  to  fit  the  local  prob¬ 
lem  into  the  fast  cache  on  each  compute  core  in  order  to  minimize  the  number  of  fetches  from  random  access  memory  (RAM), 
since  each  fetch  may  consume  hundreds  of  clock  cycles.  To  determine  the  core  count  at  which  the  local  problem  fits  into 
cache,  consider  that  the  approximate  size  of  each  state-vector  for  the  global  problem  (in  bytes)  for  DG,  which  uses  an 
element-local  storage  scheme,  is  approximately  8n„arNf(N+  l)3,  where  the  coefficient  8  is  the  size  of  a  real  number  and 
Nf  is  the  global  number  of  elements.  Since  we  need  to  store  both  the  state-vector  and  a  RHS,  the  total  size  of  the  global  prob¬ 
lem  for  N  =  8,Nf  =  323  and  nmr  =  5  is  approximately  2  GB.  The  size  of  the  global  problem  for  CG  is  slightly  smaller  due  to  a 
global  storage  scheme,  although  the  difference  is  small  for  high  polynomial  orders.  Each  quad-core  processor  on  Ranger  has 
6  MB  of  shared  L3  cache  and  each  core  has  512  KB  of  dedicated  L2  cache.  Hence  the  total  cache  per  core  on  Ranger  is  2  MB. 
Assuming  good  load  balancing,  the  local  problem  will  fit  into  cache  using  1000  cores.  Performing  the  same  calculation  for 
N  =  4,  we  see  the  local  problem  will  fit  into  cache  using  approximately  160  cores.  Therefore,  we  expect  to  see  a  super-linear 
speedup  at  these  core  counts,  which  is  evident  in  Figs.  9  and  10  (see  panels  b)  where  the  CG  and  DG  simulations  exceed  the 
ideal  scaling  curve. 


6.  Conclusion  and  future  work 

6.1.  Conclusion 

In  this  paper,  we  have  developed  a  nonhydrostatic  model  of  the  atmosphere  (NUMA)  based  on  both  CG  and  DG  discreti¬ 
zations  in  space  and  explicit  discretization  in  time.  We  note  that  purely  explicit  time-discretization  is  not  practical  for 
operational  purposes  due  to  the  stringent  CFL  restriction;  therefore,  we  will  present  the  results  of  our  implicit-explicit 
time-integrators  in  future  work.  This  paper  has  subjected  NUMA  to  a  couple  of  limited-area  simulations,  including  orographic 
flow  and  buoyant  convection  problems.  The  results  of  these  test  problems  are  in  agreement  with  either  previous  simulations, 
analytical  results,  or  physical  intuition. 

The  parallel  performance  study  described  in  Section  5  suggests  several  directions  for  the  future  of  EBG  methods  (in  par¬ 
ticular,  DG  methods).  Firstly,  good  scalability  was  exhibited  for  both  the  CG  and  DG  codes  for  C>(104)  cores,  with  near-linear 
strong  scaling  for  the  DG  code  beyond  0(1 04).  As  described  in  Sections  4.4  and  4.5,  we  attribute  these  scalability  results  to  a 
large  computational  volume  relative  to  communication  volume  and  high  data  locality,  which  yields  a  low-cache  miss  per¬ 
centage.  In  particular,  DG  exhibits  low  communication  volume  and  outstanding  data  locality,  indicating  that  DG  will  be  able 
to  achieve  ideal  scaling  up  to  0(1O5)  cores  with  no  modifications.  Secondly,  these  experiments  indicate  that  high  order  EBGs 
(e.g.  N  =  8  relative  to  N  =  4)  become  more  efficient  at  higher  core  counts.  Hence,  as  we  move  towards  the  exascale  epoch  in 
which  core  counts  of  0(1O5)  and  O(106)  are  rapidly  become  available,  these  high  order  EBGs  may  become  competitive  with 
their  established  low-order  counterparts  (finite  element  and  finite  volume  methods).  For  these  reasons,  high-order  EBG 
methods  are  excellent  candidates  for  next-generation  NWP  models. 

6.2.  Future  work 

In  conjunction  with  the  Naval  Research  Laboratory-Monterey,  we  are  incorporating  microphysical  parameterizations  into 
NUMA.  Preliminary  experiments  using  the  Kessler  scheme  [23]  have  been  conducted  within  a  2D,  serial  implementation  [13]. 
Since  physical  parameterizations  operate  on  columns  of  data  independently  of  adjacent  data,  the  problem  is  embarrassingly 
parallel  provided  the  domain  is  decomposed  in  the  horizontal  only  such  that  all  z  values  reside  on-processor.  To  facilitate  scal¬ 
ing  on  hybrid  shared-distributed  memory  architectures  (such  as  TACC’s  Ranger),  hierarchical  domain  decomposition  is  desir¬ 
able,  whereby  MPI  communication  based  on  the  algorithm  developed  in  the  present  paper  is  used  in  the  horizontal  and  either 
OpenMP  parallelization,  appropriate  for  shared  memory,  or  graphical  processor  units  (GPUs)  are  employed  for  fine-grained 
parallelism  in  the  vertical.  For  such  problems  where  the  domain  decomposition  is  only  performed  in  the  xy  plane  such  that 
all  z  values  are  on-processor  both  the  CG  and  DG  methods  will  always  have  a  much  larger  computational  volume  versus 
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communication  footprint  that  will  allow  both  methods  to  scale  perfectly  for  the  maximum  number  of  processors  accommo¬ 
dated  by  a  2D  domain  decomposition.  In  future  work,  we  will  explore  hybrid  parallelization  strategies,  implicit-explicit 
(IMEX)  time-integrators  (along  with  iterative  solvers  and  preconditioners),  and  adaptive  mesh  refinement.  The  goal  of  all 
of  these  topics  is  to  further  improve  the  efficiency  of  the  NUMA  model. 
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